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Situations of a functional predictor paired with a scalar response 
are increasingly encountered in data analysis. Predictors are often ap- 
propriately modeled as square integrable smooth random functions. 
Imposing minimal assumptions on the nature of the functional rela- 
tionship, we aim to estimate the directional derivatives and gradients 
of the response with respect to the predictor functions. In statistical 
applications and data analysis, functional derivatives provide a quan- 
titative measure of the often intricate relationship between changes 
in predictor trajectories and those in scalar responses. This approach 
provides a natural extension of classical gradient fields in vector space 
and provides directions of steepest descent. We suggest a kernel-based 
method for the nonparametric estimation of functional derivatives 
that utilizes the decomposition of the random predictor functions 
into their eigenfunctions. These eigenfunctions define a canonical set 
of directions into which the gradient field is expanded. The proposed 
method is shown to lead to asymptotically consistent estimates of 
functional derivatives and is illustrated in an application to growth 



1. Introduction. Situations where one is given a functional predictor and 
a continuous scalar response are increasingly common in modern data anal- 
ysis. While most studies to date have focused on functional linear models, 
the structural constraints imposed by these models are often undesirable. To 
enhance flexibility, several nonparametric functional regression approaches 
have been discussed. Since these models are not subject to any assumptions 
except smoothness, they are very widely applicable. The price one pays, of 
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course, is that convergence will be slower when compared with functional lin- 
ear models. The situation is comparable to that of extending ordinary linear 
regression to nonparametric regression. By abandoning restrictive assump- 
tions, such extensions greatly enhance flexibility and breadth of applicabil- 
ity. Under suitable regularity assumptions, convergence of such functional 
nonparametric models is guaranteed for a much larger class of functional 
relationships, and this insurance is often well worth the slower rates of con- 
vergence. 

Suppose we observe a sample of i.i.d. data {Xi,Yi), . . . ,{Xn,Yn), gener- 
ated by the model 



where X is a random function in the class L2{I) of square- integrable func- 
tions on the interval X= [0, 1], g is a smooth functional from L2{T) to the 
real line and e represents an error, independent of X, with zero expected 
value and finite variance. In the nonparametric approach, one aims to con- 
duct inference about g without imposing specific structure, usually that 
5 is a linear functional. The traditional functional linear model would have 
g{x) = a + J bx, where a is a constant and b a function, but even here the "re- 
gression parameter function" b cannot be estimated at the parametric rate 
n~^/^, unless it is subject to a finite-parameter model; this model has been 
well investigated in the literature. Examples of such investigations include 
Ramsay and Dalzell (1991), Cuevas, Febrero and Fraiman (2002), Cardot et 
al. (2003), Cardot, Ferraty and Sarda (2003), James and Silverman(2005), 
Ramsay and Silverman (2005), Yao, Miiller and Wang (2005b) and Hall and 
Horowitz (2007). 

While the functional linear regression model has been shown to provide 
satisfactory fits in various applications, it imposes a linear restriction on 
the regression relationship and, therefore, cannot adequately reflect non- 
linear relations. The situation is analogous to the case of a simple linear 
regression model where a nonparametric regression approach often provides 
a much more adequate and less biased alternative approach. Likewise, there 
is sometimes strong empirical evidence, for example, in the form of skew- 
ness of the distributions of empirical component scores, that the predictor 
function X is not Gaussian. The problem of estimating a nonparametric 
functional regression relation g in the general setting of (1) is more diffi- 
cult compared to functional linear regression, and the literature is much 
sparser. It includes the works of Gasser, Hall and Presnell (1998) and Hall 
and Heckman (2002) on the estimation of distributions and modes in func- 
tion spaces, and of Ferraty and Vieu (2003, 2004, 2006) on nonparametric 
regression with functional predictors. Recent developments are reviewed in 
Ferraty, Mas and Vieu (2007). 




Y = g{X)+e, 



ESTIMATION OF FUNCTIONAL DERIVATIVES 



3 



To lay the foundations for our study, we introduce an orthonormal basis 
for L2{I), say ^/^i, ■02, • • • , which, in practice, would generally be the basis con- 
nected to the spectrum of the covariance operator, V{s, t) = cov{X{s), X{t)}: 

oo 

(2) Vis,t)=Y,e,^^j{u)^lJjiv), 

i=i 

where the "0/8 are the orthonormal eigenfunctions, and the Oj^s are the 
respective eigenvalues of the linear operator with kernel V. The terms in (2) 
are ordered as 9i > 02 > ■ ■ ■ ■ The empirical versions of the ipj^s and 9j^s arise 
from a similar expansion of the standard empirical approximation V to V, 

n oo 

(3) vis,t) = -T.{Ms) - Xis)}{X,it) - Xit)} = Y,e^^,{s)i^^{t), 

1=1 J=l 

where X = Xi and order is now determined by > ^2 ^ ■ ■ ■ • The 

eigenvalues 6j vanish for j > n + 1, so the functions 'i/'n+i) V'n+2) • • • may be 
determined arbitrarily. 

The centered form of X admits a Karhunen-Loeve expansion 

00 

(4) X-E{X)=Y^i^i^j, 

j=i 

where the principal components = Jj Xipj are uncorrelated and have zero 
means and respective variances 9j. Their empirical counterparts are com- 
puted using ipj in place of ipj. 

The paper is organized as follows. In Section 2, we describe the kernel- 
based estimators that we consider for estimating the nonparametric regres- 
sion function g in model (1) on the functional domain and for estimating 
functional derivatives in the directions of the eigenfunctions V'j. In Section 3, 
rates of convergence for kernel estimators g of the nonparametric regression 
function g are obtained under certain regularity assumptions on predictor 
processes and their spectrum (Theorems 1 and 2). These results then lead 
to the consistency property (Theorem 3) for functional derivatives. A case 
study concerning an application of functional derivatives to the Berkeley lon- 
gitudinal growth study is the theme of Section 4, followed by a compilation 
of the proofs and additional results in Section 5. 

2. Proposed estimation procedures. Define the Nadaraya-Watson esti- 
mator 

... _ T.^Y,K,{x) 

where Ki{x) = K{\\x — Xi\\/h), is a kernel function and h a bandwidth. 
Here, || • || denotes the standard norm. Similar kernel estimators have 
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been suggested in the literature. We refer to Ferraty and Vieu (2006) for 
an overview regarding these proposals and also for the previously published 
consistency results for the estimation of g. While the focus of this paper is on 
the estimation of functional derivatives in the general framework of model 
(1), using the spectral decomposition for predictor processes X and charac- 
terizing these processes by their eigenbasis also leads to useful and relevant 
results regarding the estimation of g. These results are given in Theorems 1 
and 2 below, while Theorem 4 provides relevant bounds for the probability 
that X lies in a small ball, and Theorem 3 yields the desired asymptotic 
consistency of the proposed functional derivative estimator defined at (7). 

For simplicity, we shall suppose the following (although more general con- 
ditions may be imposed). 

Assumption 1. Kernel K is nonincreasing on [0,c], where c > 0, and 
the support of K equals [0, c] . 

The derivative of 5 at x is defined to be the linear operator gx with the 
property that, for functions y and scalars 5, 

g{x + 6y) = g{x) + dg^y + o{6) 

as 5^0. We may write 

00 

(5) 9x = ^lxjtj, 

i=i 

where '^xj = dxi^j is a scalar, and tj denotes the operator that takes y to yj = 
tj{y) = J yipj. We can think of ^xj as the component of gx in the direction 

From knowledge of the operator gx, accessible through the components 
7a;j, we can obtain information about functional gradients and extrema. For 
example, suppose a™™ = (a^i™, a™2™> • • •) and a™^'' = (a^f"", af2^, . . .) are de- 
fined as the vectors a = (01,02,...) that, respectively, minimize and maxi- 
mize \gxO-\, where 

00 

(6) gxa = ^7xjaj, 

i=i 

over functions o = J^j'^j'^j fo^' which ||a|| = 1 (i.e., such that J^j'^'j — !)• 
Then, the function g changes fastest as we move away from x in the direction 
of o™^^ = a^j^^ipj, which, therefore, is a gradient direction. The function 
changes least when we move from x in the direction of o™™ = a^j^ipj . 
Extremal points are characterized by 'jxj = for all j, and their identification 
is of obvious interest to identify predictor functions associated with maximal 
or minimal responses, and also the level of these responses. 
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Thus, the components ^xj are of intrinsic interest. As a prelude to esti- 
mating them, we introduce li^jj = Yi^ — Yi^ and ^iiiaj = /j(^n — 
the latter being an empirical approximation to Cjii2i — ^hj ~ ^120 (i-^-i ^o 
the difference between the principal components ^ij = J Xiipj for i = 11,12)- 
Define 

^ _ 1 _ I li-^h ~ ^^2)V'iP _ 1 ^hi2j 

Wiii2j — \\ V TA 119 — 



i«2J ^ WY — Y \\Y — Y W^^ 

W^il ^42 11 ^42 II 

which represents the proportion of the function — Xjj, that is, "not 
aligned in the direction of ipj." Therefore, Qi^i2j will be small in cases where 
Xi-^ — is close to being in the direction of ipj, and will be larger in other 
settings. We suggest taking 

EES2^ni2^(n,*2,j» 



(7) %j 



EE\l]i2^iii2jK{ii,i2j\x) 



Here, X^Slf^ia denotes summation over pairs (^1,^2) such that ^i^iaj > 0, 
(8) j^(zi,i2,i|x)=Kf "'^~/^-'V f "''~/^^'V ^^'^^'' 



hi J \ hi J \ h2 

is a kernel function and hi and h2 denote bandwidths. On the right-hand 
side of (8), the last factor serves to confine the estimator's attention to pairs 
(^1,^2), for which Xjj — Xi^ is close to being in the direction of ipj, and the 
other two factors restrict the estimator to ii and 12, such that both and 
Xi^ are close to x. The estimator ■j^j uses two smoothing parameters, hi 
and /i2- 

3. Theoretical properties. 

3.1. Consistency and convergence rates of estimators of g. To ensure 
consistency, we ask that the functional g be continuous at x (i.e., that for 
functions y and scalars 6, the following holds). 

Assumption 2. 

(9) sup \g{x + 5y) — g{x)\ ^ as (5 J, 0, 

y- lly||<i 

and the bandwidth h does not decrease to zero too slowly, in the sense that, 
with c as in Assumption 1, 

/i = /i(n) — > and nP{\\X — x\\ < cih) ^ 00 asn— >-oo, 

(10) 

where ci = c ii K{c) > 0, and otherwise ci G (0, c). 
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Given C > 0, x S L2(I) and a S (0, 1], let Q{C,x,a) denote the set of 
functionals g such that \g{x + 6y) — g{x)\ < C6°^, for all y G L2{I) satis- 
fying ||y|| < 1, and for all < 5 < 1. When deriving convergence rates, we 
strengthen (9) by asking that g be in Q{C,x^a). 

Let X = {Xi, . . . ,X„} denote the set of explanatory variables. 

Theorem 1. If Assumptions 1 and 2 hold, then g{x) g[x) in mean 
square, conditional on X , and 

(11) sup E[{g{x)-g{x)f\X]=o^{l). 

g&g{C,x,a) 

Furthermore, for all r] >0, 

sup P{\g{x) - g{x)\>r]} ^0. 
g&g(C,x,a) 

Moreover, if h is chosen to decrease to zero in such a manner that 

(12) /i2"P(||X-2:|| <ci/i)xn~^ 

as n ^ CO, then, for each C > 0, the rate of convergence of g{x) to g{x) 
equals Op{h'^°'), uniformly in g ^ Q{C,x,a): 

(13) sup E[{g{x) - g{x)Y \X] = Op{h^^), 

g€g{C,x,a) 

(14) lim limsup sup P{\g{x) - g{x)\ > Cih"'} = 0. 

To interpret (11) and (13), assume that the pairs {Xi,ei), for 1 < i < oo, 
are all defined on the same probability space, and then put Yi = Yi{g) = 
g{Xi) + £i. Write Eg{-\X) to denote expectation in the distribution of the 
pairs {Xi,Yi{g)), conditional on X. In Section 5.1 below, we shall discuss 
appropriateness of conditions such as (12), which relate to "small ball prob- 
abilities." Asymptotic consistency results for g and mean squared errors have 
been derived in Ferraty, Mas and Vieu (2007) under different assumptions. 
The convergence rate at (14) is optimal in the following sense. 

Theorem 2. If the error e in (1) is normally distributed, and if, for 
a constant ci > 0, nP[\\X — x\\ < cih) — > oo and (12) holds, then, for any 
estimator g{x) of g{x), and for C > sufficiently large in the definition of 
Q{C,x,a), there exists a constant Ci > 0, such that 

limsup sup P{\g{x) - g{x)\ > Cih'^} > 0. 

n.^oo geg(C,x,a) 

According to this result, uniformity of the convergence holds over the 
Lipschitz class of functionals Q{C,x,a). This result applies for a fixed argu- 
ment X in the domain of the predictor functions, where the functionals are 
evaluated. Further discussion of the bounds on -Pdl-'^ — x\\ < u) as relevant 
for (12) is provided in Section 5.1. 
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3.2. Consistency of derivative estimator. We shall establish consistency 
of the estimator ■jxj- To this end, let 

denote the version of Qi2j when ^j^jj is replaced by the quantity that ^j^jj 
approximates, and let h-^i^j denote the version of K{ii,i2, j\x), defined at 
(8), when Qij^i2j there is replaced by qi^i2j- 



Assumption 3. 

(a) supigi^{X(t)4}<oo; 

(b) there are no ties among the eigenvalues ^i, . . . ,6*^+1; 

(c) \g{x + y)- g{x) - g^y\ = o{\\y\\) as \\y\\ 0; 

(d) the distribution of ^ij — has a well-defined density in a neighborhood 
of the origin, not vanishing at the origin; 

(e) K is supported on [0, 1], nondecreasing and with a bounded derivative 
on the positive half-line, and < K{0) < 00; and 

(f) /ii,/i2 — > as n increases, sufficiently slowly to ensure that n^^'^ mm{hi, 
/12) 00 and {nhiY E{ki^i2j) — > 00. 



Finite variance of X guarantees that the covariance operator V , leading to 
the eigenfunctions tpj and their estimators ipj in Section 3.1, is well defined; 
and finite fourth moment, stipulated by Assumption 4(a), ensures that Wijjj — 
ipjW converges to zero at the standard root-?i rate. This assumption is, for 
example, satisfied for Gaussian processes with smooth mean and covariance 
functions. 

If we suppose, in addition, that X is a process with independent principal 
component scores / Xif^j (or the stronger assumption that X is Gaussian) 
and all the eigenvalues 9j are nonzero [we shall refer to these properties 
jointly as (Pi)], then Assumption 3(f) implies that = 0{hj) for j = 1,2 
and for all e > [call this property (P2)]- That is, both bandwidths are of 
larger order than any polynomial in n~^. To see why, note that (Pi) entails 
P{\\x-X\\ < hi) = 0(/if^) for ah Ci > 0. Also, 3(f) implies that nhiP{\\x- 
X\\ < C2hi) —f 00 for some C2 > 0, and this, together with (Pi), leads us to 
conclude that n/i'^^^^ — > 00 for all Ci > 0. That result is equivalent to (P2) 
for the bandwidth hi. Property (Pi) also implies that P{qi2j l£ ^2) = 0(/i^^) 
for all Ci > 0, and 3(f) implies that nP{qi2j < C2/i2) 00 for some C2 > 0, 
which, as before, leads to (Pi), this time for the second bandwidth. 



Theorem 3. If Assumption 3 holds, then j^j ~^lxj in probability. 
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Using notation (5), if e = Z^jLi ^ji^j with J2j ~ ^ Jo < OO; the func- 
tional directional derivative in direction e at x is gxe = J2j ^j^xj] see also (6), 
where e is obtained by choosing aj = ej,l < j < jo,aj = 0,j > Jo- If Assump- 
tion 3 holds for all j < jo , it is an immediate consequence of Theorem 3 that 
the estimated functional derivative = J2j ^jlxj at x in direction e is con- 
sistent (i.e., satisfies (jx^^ 9x6 in probability). As this holds uniformly over 
all direction vectors e, the functional gradient field for directions anchored 
in the span of {"01, . . . ,4^ jo} can be estimated consistently. 

If we take the operator Qx, defined by Qxa = J2j<r^xjCij (where r > 1 is 
an integer and a = J2j is function), to be an empirical approximation 
to gx, the operator given by gxa = J2j IxjCij, if the conditions in Assumption 
3 hold for each j, and in addition J2j Ixj < then there exists a (generally 
unknown) deterministic sequence r = r{n,x) with the following properties: 
r{n,x) — > oo as n — > oo; whenever ||a|| < oo, gxa — gxa — > in probability; 
and moreover, gx — > gx in norm as n ^ oo, where the convergence is again in 
probability. An explicit construction of such a sequence r(n,x), and thus of 
an explicit estimate of the derivative operator with these properties, would 
require further results regarding the convergence rates for varying j in The- 
orem 3, and remains an open problem. 

4. Application of functional derivative estimation to growth data. The 

analysis of growth data has a long tradition in statistics. It played a pio- 
neering role in the development of functional data analysis, as evidenced by 
the studies of Rao (1958), Gasser et al. (1984), Kneip and Gasser (1992), 
Ramsay and Li (1998) and Gervini and Gasser (2005) and remains an active 
field of statistical research to this day. 

We explore the relationship between adult height, measured at age 18 
(scalar response), and the growth rate function observed to age 10 (func- 
tional predictor), for 39 boys. Of interest is the following question: how do 
shape changes in the prepubertal growth velocity curve relate to changes in 
adult height? Which changes in the shape of a prepubertal growth veloc- 
ity curve of an individual will lead to the largest adult height gain for an 
individual? These and similar questions can be addressed by obtaining the 
functional gradient of the regression of adult height versus the prepubertal 
growth velocity trajectory. Such analyses are expected to provide us with 
better understanding of the intricate dynamics and regulatory processes 
of human growth. Functional differentiation provides an excellent vehicle 
for studying the effects of localized growth velocity changes during various 
stages of prepubertal growth on adult height. 

For this exploration, we use growth data for 39 boys from the Berke- 
ley longitudinal growth study [Tuddenham and Snyder (1954)], where we 
include only measurements obtained up to age 10 for the growth veloc- 
ity predictor processes. The 15 time points before age 10 at which height 
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measurements are available for each boy in the Berkeley study correspond to 
ages {1, 1.25, 1.5, 1.75, 2, 3, 4, 5, 6, 7, 8, 8.5, 9, 9.5, 10}, denoted by {sj}j=i,...,i5. 
Raw growth rates were calculated as first order difference quotients Xij = 
(/ijj-i-i — hij)/{tj-\-i — tj), where hij are the observed heights at times sj for 
the ith boy, and tj = {sj + Sj+i)/2, i = I, . . . , 39, j = 1, . . . , 14. These raw 
data form the input for the computation of the functional decomposition 
of the predictor processes into mean function, eigenfunctions and functional 
principal component scores. To obtain this decomposition, we used an imple- 
mentation of the functional spectral methods described in Yao et al. (2003) 
and Yao, Miiller and Wang (2005a). 

Applying a BIC type criterion based on marginal pseudo-likelihood to 
choose the number of components in the eigenrepresentation, three compo- 
nents were selected. The resulting smooth estimates of fitted individual and 
mean growth velocity curves are shown in Figure 1. The first three compo- 
nents explain 99.5% of the total variation (78.9%, 17% and 3.6%, resp.), and 
the corresponding estimated eigenfunctions are displayed in the left panel 
of Figure 2. The first eigenfunction corresponds to a rapid initial decline in 
growth velocity, followed by a relatively flat increase with onset around age 
5 toward the right end of the considered age range, while the second eigen- 
function contains a sign change and provides a contrast between growth 
rates after age 2 and those before age 2. The third eigenfunction describes a 
midgrowth spurt around ages 6-7, coupled with an especially rapid decline 
in growth rate before age 3. 

To visualize the estimated functional derivatives, a derivative scores plot 
as shown in the right panel of Figure 2 is of interest. The coefficient estimates 
for the first two eigendirections are plotted [i.e., the points, defined at (5), 
(7Xi,i, 7x,,2)) evaluated at each of the 39 predictor functions Xi]. This figure 
thus represents the canonical functional gradient vectors at the observed 
data points, truncated at the first two components. These gradient vectors 
are seen to vary quite a bit across subjects, with a few extreme values present 
in the derivative corresponding to the first eigendirection. 

The gradients are generally positive in the direction of the first eigen- 
function and negative in the direction of the second. Their interpretation is 
relative to the shape of the eigenfunctions, including the selected sign for 
the eigenfunctions (as the sign of the eigenfunctions is arbitrary) . If the gra- 
dient is positive in the direction of a particular eigenfunction ipj, it means 
that adult height tends to increase as the corresponding functional principal 
component score increases. So, in order to interpret the gradients in the 
right panel of Figure 2, one needs to study the shapes of the corresponding 
eigenfunctions as depicted in the left panel. When observing the shapes of 
first and second eigenfunction in the left panel of Figure 2, adult height is 
seen to increase most if the growth velocities toward the right end of the 
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2468 10 2468 10 

Time (years) Time (years) 

Fig. 1. Fitted trajectories for individual predictor growth velocity curves (left panel) and 
mean growth velocity curve (right panel) for the Berkeley growth data (n — 39 ). 

domain of the growth rate, predictor curves are larger, a result that is in 
line with what one would expect. 

Using the first K components, we define functions g*it) = J2f=i lXij'4'j{t) 
for each subject i. Then, for any test function z{t) = Y^f=i Zjil)j{t) with H^H = 

1, one has / g*{t)z{t) dt = J2f=i7x^,jZj , so that the functional directional 
derivative at Xj in direction z is obtained through an inner product of z 
with g* . We therefore refer to g^ as the derivative generating function at 
Xj. In the application to growth curves, we choose K = 3 and this function 
can be interpreted as a subject-specific weight function, whose inner product 
with a test function z provides the gradient of adult height when moving 
from the trajectory Xi in the direction indicated by z. It is straightforward 
to obtain estimates 

K 

(15) 9:{t)=J2'fx,Ait) 

of these derivative generating functions by plugging in estimates for 'jx^j 
and ijj{t) as obtained in (3) and (7). 
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Fig. 2. Smooth estimates of the first three eigenfunctions for the velocity growth curves, 
explaining 78.9% (solid), 17% (dashed) and 3.6% (dash-dotted) of the total variation, 
respectively (left panel) and estimated functional derivative coefficients (7Xi,i, 7x^,2 ) (7), 
in the directions of the first (x-axis) and second (y-axis) eigenf unction, evaluated at the 
predictor curves Xi (dots), as well as at the mean curve fi (circle) (right panel). 



Estimated derivative generating functions g* for K = 3 are depicted in 
Figure 3 for all 39 trajectories Xi in the sample. These empirical deriva- 
tive generating functions are found to be relatively homogeneous. Estimated 
functional directional derivatives in any specific direction of interest are then 
easily obtained. We find that gradients are largest in directions z = g* /\\gi\\ 
(i.e., in directions that are parallel to the derivative generating functions 
g^). This means that largest increases in adult height are obtained in the 
presence of increased growth velocity around 2-4 years and past 8 years, 
while growth velocity increases between 5-7 years have only a relatively 
small effect. 

It is of interest to associate the behavior of the derivative operators with 
features of the corresponding predictor trajectories. The predictor trajecto- 
ries Xi , for which the derivative coefficients 7x ^ ,j have the largest and small- 
est absolute values in each of the first three eigendirections (for j = 1,2,3), 
are depicted in the upper panels of Figure 4. The lower panels show the 



12 



P. HALL, H.-G. MULLER AND F. YAO 



corresponding derivative generating functions. One finds that the functional 
gradients of growth velocity curves that contain time periods of relatively 
small growth velocity are such that increased growth velocity in these time 
periods is associated with the largest increases in subsequent adult height 
(dashed curves in left and middle panel, dotted curve in right panel), as does 
slowing of above-normal high post-partum growth velocities (dashed curve 
in right panel). 

A systematic visualization of the connection of predictor functions and 
the gradient field, as represented by the derivative generating functions, is 
obtained by considering families of predictor trajectories X{t;aj) = fl{t) + 
ajipj{t) that move away from the mean growth velocity trajectory in the di- 
rection of a specific eigenfunction, while the other eigenfunctions are ignored, 
as shown in the upper panels of Figure 5 for the first three eigenfunctions. 
The corresponding derivative generating functions are in the lower panels. 
This visually confirms that adult height gains are associated with increased 
growth velocities in those areas where a subject's velocities are relatively 




1 23456789 10 

Time (years) 



Fig. 3. Estimated derivative generating functions gi{t) (15) for all subjects Xi (black) 
and for the mean function (red) of the Berkeley growth data, based on the first three 
eigenfunctions. 
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low, especially toward the right end of the domain of the velocity predictor 
curves. 

As the sample size in this example is relatively small, it is clear that 
caution needs to be exercised in the interpretation of the results of this 
data analysis. The results presented here follow the spirit of exploratory 
data analysis. We find that the concept of functional derivatives can lead 
to new insights when analyzing functional data which extend beyond those 
available when using established functional methods. Many practical and 
theoretical issues require further study. These include, for example, choice 
of window widths and the estimation of functional derivatives for data that 
are irregularly or sparsely measured. 

5. Additional results and proofs. 




w I — , , 1 1 1 cn I — I 1 , . 1 cn 

c c c 

O / X o o 




Q _4 t — . . . . 1 Q _4 1 — . . . , 1 Q _4 1 — . . . , 1 

2468 10 2468 10 2468 10 

Time (years) Time (years) Time (years) 



Fig. 4. Predictor trajectories (top panels) and corresponding derivative generating func- 
tions g*(t) (15) (bottom panels) which have the largest (dashed) and smallest (dotted) 
absolute values of derivative coefficients ^xj (7) in the directions of the first (j = 1, left), 
second (j = 2, middle) and third (j = 3, right) eigenf unctions, as well as the mean func- 
tions (solid). 
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2468 10 2468 10 2468 10 

Time (years) Time (years) Time (years) 



Fig. 5. Top: predictor trajectories X{t;aj) = ji{t) + ajtpj{t) with aj — —2 (dashed), 
(solid), +2 (dotted), where j = 1,2,3 from left to right. Bottom: corresponding derivative 
generating functions (15). 



5.1. Bounds on P{\\X — x\\ < u) . Reflecting the infinite-dimensional na- 
ture of functional-data regression, the rate of convergence of the "small ball 
probabilities" -P(||^ — x|| < u) to zero as ti — > is generally quite rapid; in 
fact, it is faster than any polynomial in u. See (19) below. In consequence, 
the convergence rate of g{x) to g{x) can be particularly slow. Indeed, unless 
the Karhunen-Loeve expansion of X is actually finite dimensional, the rate 
of convergence evidenced by (14) is slower than the inverse of any polynomial 
in n. 

The fastest rates of convergence arise when the distribution of X is clos- 
est to being finite dimensional; for example, when the Karhunen-Loeve ex- 
pansion of X can be written as X = J2jCj''Pjj where var(^j) = 9j and the 
eigenvalues 9j,j > 1, decrease to zero exponentially, rather than polynomi- 
ally, fast as j increases, where the are uncorrelated. Therefore, we shall 
focus primarily on this case and require the following. 
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Assumption 4. For constants 3,(3 >0, 
(16) logOj = -Bf + o{f) as J ^ oo, 

1/2 

and the random variables rjj = ^j/Oj are independent and identically dis- 
tributed as rj, the distribution of which satisfies 

Biv!' < P{\ri\ <u) < B2U^ for all sufficiently small u > 0, and 

(17) 

P{\r]\>u)<B3{l+u)-^^ forann>0, where Bi, ^4, 0. 

Take x = 0, the zero function, and, with b, B and /3 as in (16) and (17), 
define 

(18) -(n)=exp{-^(|)'^'jlogn|(^+i)//^}. 

Theorem 4. // (16) and (17) hold, then, with 7r{u) given by (18), 

(19) P(||X|| <n) = ^(n)^+°(^) asnjO. 

Combining Theorems 1 and 3, we deduce that, if the eigenvalues Oj de- 
crease as indicated at (16), if the principal components have the distri- 
butional properties at (17), and if the bandwidth h is chosen so that (12) 
holds, then the kernel estimator g{x) converges to g{x) at the mean-square 
rate of 

= exp(-2a|log/i|) 

.exp[-{l.„,l,}2„(^?±i)^"^^-(f)""^^"„„,„)^/<.«.]. 

For each fixed /3, this quantity decreases to zero more slowly than any 
power of n~^, although the rate of decrease increases as (3 increases. A 
typical example where conditions (16) and (17) are satisfied is that of a 
process where 9j = exp(— i?j^), where the distribution of rj in Assumption 4 
has a bounded nonzero density in a neighborhood of the origin, and where 
is the standard Fourier series. In this case, one finds that (3 = b = l and 

7r(ii) = exp{— c(logu)('^+"^)/^} = u~^^^°^^''^^^'^ , for some c > 0, corresponding 
to faster than polynomial convergence toward 0. Of course, the condition on 
the distribution of 77 is satisfied if the process X is Gaussian. 

Theorem 4 establishes that, in the case j; = 0, the probability P(||X — x|l < 
u) typically does not vanish, even for very small u, and, in this context, (19) 
gives a concise account of the size of the probability. If we take x = and 
replace X by Xi — X2, for which the calculations leading to (19) are identical 
in all essential respects to those leading to (19), then we obtain a formula for 
the average value ofP(||Xi— x|| < n) over all realizations x of X2 . Therefore, 



16 



P. HALL, H.-G. MULLER AND F. YAO 



(19) provides substantially more than just the value of the probability when 

X = 0. The case of fixed but nonzero x, where x = J2j ^y^^j ^^'^ ^i'^ 
are uniformly bounded, can be treated with related arguments, and also 
the setting where the unbounded, although it needs more detailed 

arguments. 

If 9j decreases to zero at a polynomial rate, rather than at the expo- 
nential rate stipulated by (16), then the probability -P(||X — x\\ < u) de- 
creases to zero at rate exp(— Ciu"*-^^) as u decreases to 0, rather than at the 
rate exp(— Ci | logup^) indicated by Theorem 3 for constants Ci,C2 > 0. 
Very accurate results of this type, in the case where x = 0, are given by 
Gao, Hannig and Torcaso (2003), who also provide additional relevant ref- 
erences. It is noteworthy that these results also pertain to non-Gaussian 
processes, while early results along these lines for Gaussian processes can 
be found in Anderson and Darling (1952). Decay rates of the closely re- 
lated type M*^^ exp(— Ciu~'^2) for C3 > were featured in Ferraty, Mas and 
Vieu (2007), among several other rates that are primarily associated with 
finite-dimensional processes. 

We conclude from this discussion that the decay rates of the small ball 
probabilities are intrinsically linked to the decay rates of the eigenvalues of 
the underlying process. The fast decay rates associated with polynomially 
converging eigenvalues mean that this case is not particularly desirable from 
a statistical point of view. 

5.2. Proof of Theorem 1. Let o"^ denote the variance of the error e in 
(1). Set Nj = J2i Ki{xy for j = 1, 2, and note that N2 < K{0)Ni, as K{-) is 
nonincreasing and compactly supported on [0,c]. Therefore, 

E[{9{x)-g{x)}'\X] 

[E{g{x)\A:} - g{x)]' +vav{g{x)\X) 

<^'E^Kf{x) 



(20) 



< max \g{Xi) - g{x)\I{\\Xi - x\\ <ch) + 

i-L,...,n iZ^j-f^iV^JI 

< sup \g{x) - g{x + y)\ H . 

y '■ ll'i/ll "^1 

Continuity of g at x [i.e., (9)] implies that the first term on the right-hand 
side of (20) converges to zero. Note that Ki{x) > Ki{x)I{\\Xi — x\\ < cih) > 
K{ci)I{\\Xi - x\\ < cih), where ci is as in (A2). Then, (10) entails iVf ^ 
with probability 1, and by monotone convergence E{Nr^) ^0. Together 
with (20), these properties imply the first part of the theorem. The second 
part, comprising (13) and (14), is obtained on noting that (20) entails 

sup E[{g{x) - gix)}'\X] < C\chf^ + 

g&g{C,x,a) ^^1 
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yea) ^ j^^c^^nP{\\X-x\\<cih) 
and E{N{^) < E[{J2^I{\\X^ - x\\ < cih)}^^] x {nPi\\X - x\\ < cih)}~^. 

5.3. Proof of Theorem 2. Without loss of generality, x = 0. Let / denote 
a function defined on the real line, with a derivative bounded in absolute 
value by Bi, say, supported only within the interval [—B2,B2], and not 
vanishing everywhere. Then, / itself must be uniformly bounded, by B^, say. 
Define gi = and 52(2/) = If ll^ll ^ ^ then, since < a < 1, 

152(2/) - 52(0)1 = h-\fm\/h) - /(0)| < h'^Bi\\y\\/h < h^BM/hr 

= Bi\\yr, 

while, if ||y|| > h, 

l52(y)-52(0)|<2/i"i?3<2i?3||yr. 

Therefore, (72 S G{C,0,a) provided mayi{Bi,2Bs) < C. 

The theorem will follow if we show that, in a classification problem where 
we observe n data generated as at (1), with the errors distributed as Normal 
A^(0, 1) and g = gi or g2, with prior probability ^ on either of these choices, 
the likelihood-ratio rule fails, in the limit as n ^ 00, to discriminate between 
gi and g2- That is, with Yi = Si (the result of taking g = gi in the model), 
and with p defined by 

_ niexp[-l/2{y,-gi(X,)p] 
^ niexp[-l/2{K,-52(X.)}2]' 

we should show that 

(21) P(p > 1) is bounded below 1 as n — > cx). 

Now, 

n 

2logp = Y,{g2iX,)^ -2eig2{Xi)}, 



i=l 



which, conditional on X, is normally distributed with mean = J2i 92{Xi)'^ 
and variance 4s^. Therefore, (21) holds if and only if 

(22) lim limsupP(4 > B) = 

and so we can complete the proof of Theorem 2 by deriving (22). 

If we choose the radius B2 of the support of / so that < B < ci, then 
|fi'2(2;)| < i?3/i°/(||x|l < ci/i), in which case 

n 

(23) sl<Blh'^J2l(\\X^\\<^^h). 
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Since, by assumption, nP(||X|| <cih)^oo, then 

J:^I{\\X^\\<C,h) 

nP{\\X\\ <cih) 

in probability. This property, (12) and (23) together imply (22). 

5.4. Proof of Theorem 3. Write, simply, Ki^i^j for K{ii,i2,j\x). Assump- 
tion 3(e) implies that 



(24) 



Ki^i^j = 0, unless each of the following holds: \\Xi^ ~ x\\ < /ii, 
- x\\ < hi and Qi^i^ < /i2- 



Given (5 > 0, let s{6) equal the supremum of \g{x + y) — g{x) — gxy\ over 
functions y with ||y|| <6. Then, by Assumption 3(c), 

(25) 6~^s{S)^0 as^iO. 

Write £i-i^i2 for the event that \\Xii^ — x\\ < hi for = 1,2. If (Sj^^j holds, 

\g{X,,) - g{Xi,) - g^iX,, - Xi,)\ <2s{hi). 

Therefore, defining e^^jj = e^^ — and assuming Si^^i^, 

\Yi, -Yi, - {g,(Xi, - Xi,) + e^,i,}\ < 2s{hi). 

Hence, defining ii^i^j = ^i^j - ^i^j, noting that 5x(^n - Xi^) = 'Ek^hhklxk, 
and using (24), we have. 



(26) 



^ (j) oo (j) ^ 

EE 

V «l,«2 k=l ii,i2 ' 



(i) 



<2s{hi)J2J2K' 



J112J • 



«1,«2 



Now, 



(27) 



le 



«1«2J S«ll2J 



< \\Xi, - X,, II llV'i - II < 2/ii II V^,- - II, 
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where the last inequahty holds under the assumption that the event fj^jj 
obtains. Combining (24), (26) and (27), we deduce that 



(28) 



5Z $Z (■^i ^2)-^n«2i 

~ ( Txj ^ ^ iiii2jK'> 
\ 11,12 



(j) 

11,12 k : k^j 



0) \ 

+ 5Z $Z ^H*2-^n«2j 1 
n,i2 / 



(i) 



n,«2 



Note, too, that 

0-) 



5Z $Z -^n*2i E/ ^iii2klxk 
11,12 k : k^j 

(i) 



= E E ^ii»2i E ^^fc / ~ ^^2)^-^ 

jl,i2 k:k^j 

(i) / X l/2r c . 

(29) <EE^n.2i( E ^Ik) [ E [j{X,i-X,,)i^, 



11,12 



iJ) 



'■«1«2J 



<ii5.iiEE^^ 

n,*2 
(i) 

^ llfx|lEE^»ii2i 

«1,«2 



(i) 



ll^n - 



«2 I 



1/2 



1/2 



+ 8||^,-Vill||^n-^, 



^2 I 



1/2 



< 2||5.||/ii E E ^n22.(fti.2.- + 8|l^.- - i^jWf' 



«1,«2 



(i) 



< 2ibxii/ii(/i2 + 8|ivi, - V'iiD'/'E E 



«1«2J- 
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To obtain the third-last inequality in (29), we used the fact that, with a ■ 
\J{Xi^ -XjJVjl, b=\J{Xi^ -Xi2)'tpj\ and 



(30) 



c=\\Xi, -XiJ\\ijj-i;j\\<2\\Xi, -XiJ<4hu 



where [in each of (30) and in (31) below] the last inequality is correct pro- 
vided Si^i^ holds, we have used the fact that |a — 6| < c and \a\ < \\Xi-^ — Xi^ \\ 
imply that 



(31) 



\a^ -h^\ <c(2a + c) <4||'(/;j - Vjllll^ii -X 



*2 I 



<8|| 



To obtain the last inequality in (29), we used (24) and the fact that Qi^i^j < 
/i2 if Ki^i^j 7^ 0. 

Combining (28) and (29), we find that 



(i) 



«2J 



(32) 



\ 21,12 n,«2 / 

< 2hi{h^^s{hi) + |7^.j I Wi^j - ijj II 



+ ll5.||(^2+8||Vi,-Vill)'/nEE^n 



«2J- 



Result (32) controls the numerator in the definition of jxj at (7). To control 
the denominator there, use (27) to show that 

(i) (j) 
E E ^iii2jKni^j >J2J2 max(0, - Ci^j - 2/ii \\iPj - IDKi^i^j 
ii,i2 n,«2 



(33) 



-EE max(0, ^iij - ii2j)Ki^ 



«2J 



il,l2 



(i) 



2/iiii'0i-V'iiiEE^»ii2i- 



n,«2 



[Recall that Yl,Yl^iii2 denotes summation over (^1,^2) such that ii-^i^j > 0.] 
Using Assumption 4(d), (e) and (f), it can be proved that, for a constant 
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B>0, 

U) (i) 
(•^4) H max(0,Ciij - Ci2j)Ki^i2j > {1 + Op{l)}Bhi ^ ^ i^niaj- 

[Note that, by Assumption 3(f), n~^/^/ min(/ii, /i2) — > 0.] From Assumption 
3(a) and (b), it follows that 

(35) ||V;,-V'ill = 0p(n-i/2). 
Together, (33)-(35) imply that 

(i) (i) 

(36) E E ih^2JK^,^,j > {1 + Op(l)}i?/ll E E ^h^2j 

for the same constant B as in (34). This result controls the denominator at 
(7). 

Prom (7), (25), (32) and (36), we deduce that 

(37) = + Op , ^rj) + Op(l). 

The variance of the ratio on the right-hand side of (37), conditional on the 
explanatory variables Xj, equals 

Opj {^\ E E ^n.2.) I = Op{{{nhrfE{h,,,,)Y^\ = Op(l), 

where, to obtain the last identity, we used Assumption 3(f). Therefore, (37) 
implies that = 7xj + Op(l), which proves Theorem 3. 

5.5. Proof of Theorem 4- Observe that, for each t £ (0, 1) and with Df = 



(38) p{\\x\\<u) = p[Y,ei7f^< 

\j=i 



<\{P{9,r^]<u% 

oo 

>n^(^k'<A«'), 



where, to obtain the lower bound, we used the property 

Coo \ f oo 

E^,,J < n^j = p\Y:fr\^H - Dtu') < 0^ 

> P{0]ril < Dtu^ for each j). 
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1/2 

Define J = J{u) to be the largest integer such that u/Oj < C, where C is 
chosen so small that Biu^ < P{\r]\ <u) < B2U^ for < u < C- Then, 

oo J 

i=i i=i 



exp|i&Bf:/ + o(j/^+^)| 



(39) 

as u I 0, where vr is defined at (18). 

1 /2 t ll 

Redefine J to be the largest integer such that "u/^j ^ C- Then, using 
the argument leading to (39), we may show that 

rip(0*^|<An2) 

(40) = exp{--^ (-|) '^'l logn|(^+^)/'^ + o{\ \oguf^')'P] 

Also, for j > J + 1, 

(41) TT, = P(0*7?2 > ^^^2) < ^ {D]/\/ef)r^\ 

Note, too, that, for a constant i?5 = -B5(t) E (0,1), we have VTj G (0,-65) for 
J > J + 1, and 

1 - TTj = exp ( - ^ ^ ] > exp(-S67rj) 
from which it follows that 



.=1 ^ 



n (l-vr,)>exp -i?6 E >exp E (^f/n)^4, 

j=j+i V i=J+i / I j=J+i J 

which is of smaller order than the right-hand side of (40). Combining this 
result with (40), and noting that t S (0,1) on the right-hand side of (40), 
can be taken arbitrarily close to 1, we deduce that, as u J, 0, 

00 

(42) n ^(^1^1 ^ Dtu^) = 

Together, (38), (39) and (42) imply (19). 
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